Visium HD Analysis

Author

Alex Bartlett, Meeta Mistry, Noor Sohail, and Will Gammerdinger

Published

Invalid Date

Approximate time: 2 hours and 45 minutes

Learning Objectives

  • Describe the elements of the Seurat object and how they are generated
  • Visually inspect and compare spatial scRNA-seq data before and after filtering
  • Execute clustering workflows and visualize results on a tissue section
  • Annotate celltypes using both spatial and gene expression information

Visium HD

Mouse Brain Visium HD Dataset

The Visium HD platform is compatible with human and mouse fresh frozen, fixed frozen, and formalin-fixed paraffin-embedded (FFPE) tissue sections. For this lesson, we will be working with data from a fresh frozen coronal section of a mouse brain sample.

Each Visium HD slide has the same 6.5 x 6.5mm capture area as previous Visium products but is covered with about 11 million tiles. These 2µm x 2µm squares are arrayed in a continuous lawn across the entire capture area. The squares are each uniquely barcoded with an oligonucleotide and contain probes allowing for the detection of the full coding transcriptome. As such, Visium HD is categorized as a sequencing-based technology.

Imaging-based Technologies

These methods utilize floresence to quantify gene expression on a tissue slide. Specifically utilizing fluoresence in situ hybridization (FISH) to measure expression of a select panel of genes (selected by the researcher) using probes. Therefore we are able to evaluate the expression for each individual cell after segmentation.

Some popular imaging-based technologies include:

  • seqFISH
  • MERFISH
  • Xenium

Preprocessing Data with Spaceranger

Sequencing facilities often output scRNA-seq data, including spatial scRNA-seq data, in FASTQ format. Because this is Visium HD data from 10X Genomics, we used their proprietary pre-processing software Space Ranger to process the FASTQ files into a count matrix and other images. Specifically, the spaceranger count command aligns the reads in the FASTQ files against a transcriptomic reference and provides their spatial location using the oligonucleotide barcode.

Note that Space Ranger requires a Linux system with at least 32 cores, 64GB of RAM, and 1TB of disk space.

A sample command for running spaceranger count is:

spaceranger count --id=hd_count \
   --transcriptome=/path/to/refdata-gex-GRCh38-2020-A \
   --fastqs=/path/to/fastq \
   --probe-set=/path/to/Visium_Human_Transcriptome_Probe_Set_v2.0_GRCh38-2020-A.csv \
   --slide=H1-YD7CDZK \
   --area=A1 \
   --cytaimage=/path/to/CAVG10539_2023-11-16_14-56-24_APPS115_H1-YD7CDZK_A1_S11088.tif \
   --image=/path/to/APPS115_11088_rescan_01.btf \
   --create-bam=false

When spaceranger count completes successfully, it will generate a variety of outputs (seen below), which will enable the analyst to perform further analysis in R/Python or using the proprietary Loupe browser from 10X Genomics.

spaceranger output

We can view and explore the web summary HTML of our data found in the “reports” folder of your project.

In the Visium HD assay, Space Ranger also bins the data in square of various sizes, including:

  • 2µm x 2µm bins
  • 8µm x 8µm bins
  • 16µm x 16µm bins

The single-digit micron resolution is a big technological improvement over original Visium’s original ∼55μm spots. Having access to 2μm bins along with matching morphology information provides a great opportunity for reconstructing single cells from the data, which is undoubtedly very powerful.

However, because the 2µm x 2µm squares (and even the 8µm x 8µm bins) are so small, there is a potential for very little biological signal to be captured per bin. Additionally, the sheer number of bins at these higher resolutions can present challenges in terms of computational time and resources.

For this lesson, we will use the 16µm x 16µm bins of the cropped Visium HD slide to run locally on laptops.

NGS-based Spatial Transcriptomics Analysis workflow

The overarching steps for analyzing a sequencing-based transcriptomics dataset is as follows:

Setting Up in R

Downloading the Data

For this module, we will be working within an RStudio project. In order to follow along you should have downloaded the R project.

Where the Data Exists

If you haven’t done this already, the project is located in “Dataset for workshop” -> “Day 2- NGS-based- VisiumHD” in the course DropBox.

Once downloaded, you should see a file called visiumHD_nanocourse.zip on your computer (likely, in your Downloads folder).

  1. Unzip this file. This will result in a folder of the same name.

  2. Move the folder to the location on your computer where you would like to perform the analysis.

  3. Open up the folder. The contents will look like the screenshot below:

  4. Locate the .Rproj file and double-click on it. This will open up RStudio with the “visiumHD_nanocourse” project loaded.

  5. Open a new Rscript file.

  6. Start with some comments to indicate what this file is going to contain:

# February 26th, 2025
# Spatial Transcriptomics: Session 2
  1. Save the Rscript in the code folder as visiumHD.R. Your working directory should look something like this:

Loading Libraries

Next, we will need to be sure to load the libraries that we will be using:

# Load libraries
library(tidyverse)
library(patchwork)
library(Seurat)
library(qs)
library(SeuratWrappers)
library(Banksy)
library(quadprog)
library(spacexr)

# Increases the size of the default vector
options(future.globals.maxSize= 2000000000)

Creating the Seurat Object

The Seurat object is a custom list-like object that has well-defined spaces to store specific information/data for single-cell experiments, including spatial experiments and Visium HD.

The Seurat package provides a function Load10X_Spatial() to easily create a Seurat object from the output of Space Ranger. The Load10X_Spatial function takes as input the feature matrix and the low-resolution tissue image from the output of Space Ranger and generates a Seurat object containing both gene-level counts and spatial information.

We will not have you run this code, as this can take some time and the SpaceRanger output files are quite large to share. Instead, you will load the pre-made Seurat object.

localdir <- '../path/to/spaceranger/outs/'

#Load the raw feature matrix
object <- Load10X_Spatial(data.dir = localdir,
                          filename = 'raw_feature_bc_matrix.h5',
                          bin.size = 16)

Explore the Object

Let’s read in the Seurat object and talk about some very basic slots that we will be accessing.

# Load in Seurat object
# object <- qread('data_processed/MsBrain_FF-A1_subset.qs')
# object <- UpdateSeuratObject(object)

object <- readRDS("data_processed/MsBrain_FF-A1_subset_updated.rds")

We can print the Seurat object by calling the object we created in the console:

object
An object of class Seurat 
32285 features across 43167 samples within 1 assay 
Active assay: Spatial.016um (32285 features, 0 variable features)
 1 layer present: counts
 1 spatial field of view present: slice1.016um

Now we can examine its major features, which we will add to and alter throughout the lesson:

Exercise

There are 3 things about our Seurat object printout that would be different if we were using the 8µm x 8µm binning instead of the 16µm x 16µm binning. What are these three differences?

Quality Control

The main objective of quality control is to only keep data from bins that are high-quality. This makes it so that when we cluster our bins, it is easier to identify distinct cell type populations.

In Visium HD data, the main challenge is in delineating bins that are poor quality from bins containing reads from less complex cells. If you expect a particular cell type in your dataset to be less transcriptionally active as compared other cell types in your dataset, the bins underneath this cell type will naturally have fewer detected genes and transcripts. However, having fewer detected genes and transcripts can also be a technical artifact and not a result of biological signal.

Various metrics can be used to filter low-quality bins from high-quality ones, including:

Metric Description
UMI counts per bin Number of unique transcripts (UMIs) detected per bin. Because the bins are very small, this number is lower than what is typically observed in non-spatial scRNA-seq data.
Genes detected per bin Number of unique genes detected per bin. Because the bins are very small, this number is lower than what is typically observed in non-spatial scRNA-seq data.
Complexity (novelty score) A measure of how diverse the captured transcripts are within a bin. If there are many captured transcripts (high nUMI) but a low number of genes, this suggests repeatedly sequencing a small set of genes (low complexity). Good-quality bins generally have a complexity score above 0.80. The complexity score is computed as: \(\text{Complexity Score} = \dfrac{\text{Number of Genes}}{\text{Number of UMIs}}\).
Mitochondrial counts ratio Identifies bins with potential mitochondrial contamination from dead or dying cells. Poor-quality bins are typically defined as those with a mitochondrial ratio above 0.2, unless high mitochondrial content is expected for the sample. This ratio is computed as: \(\text{Mitochondrial Ratio} = \dfrac{\text{Number of reads aligning to mitochondrial genes}}{\text{Total reads}}\).

Let’s take a quick look at the data and make a decision on whether we need to apply any filtering. We will examine the distributions of UMI counts per bin and genes detected per bin to determine reasonable thresholds for those metrics to implement during QC filtering.

Pre-filtering

To create some plots first, we will need to create a metadata object using this command:

object_meta <- object@meta.data

Now we can plot the number of UMIs (nUMI) and the number of genes (nGene) side-by-side. For both of the plots, we expect to see a bimodal distribution, with one peak representing bins containing lower-quality cells with fewer genes and UMIs and another peak representing bins containing healthy cells with more genes and UMIs. Ideally, the peak representing lower-quality and dying cells is small and the peak representing healthy cells is large.

# Create a plot for nUMI
dist_counts_before <- object_meta %>%
  ggplot(aes(x=nCount_Spatial.016um)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  ylab("Cell density") +
  xlab("Number of UMIs per bin") +
  ggtitle('Pre-QC UMIs/Bin') +
  theme(plot.title = element_text(hjust = 0.5))

# Create a plot for nGene
dist_features_before <- object_meta %>%
  ggplot(aes(x=nFeature_Spatial.016um)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  ylab("Cell density") +
  xlab("Number of genes per bin") +
  ggtitle('Pre-QC Genes/Bin') +
  theme(plot.title = element_text(hjust = 0.5))

dists_before <- dist_counts_before | dist_features_before
dists_before

Exercise

Using the distribution plots in the app below, what do you think would be good minimum thresholds for nGene and nUMI?

Post-Filtering

We will apply very minimal filtering here, with nUMI > 100 and nGene > 100. It has been shown that low expression can be biologically meaningful for spatial context so we won’t be as stringent as we normally are with scRNA-seq.

# Create a filtered object
object_filt <- subset(object, 
                      subset = (nCount_Spatial.016um > 100) & 
                               (nFeature_Spatial.016um > 100))

Now, we can create similar plots with filtered data. As expected, we see that the small left peak in the distribution has vanished, leaving the higher quality bins, which are the majority of the data.

# Create a new metadata data frame from filtered data
object_filt_meta <- object_filt@meta.data

# Plot nUMI
dist_counts_after <- object_filt_meta %>%
  ggplot(aes(x=nCount_Spatial.016um)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  ylab("Cell density") +
  xlab("Number of UMIs per bin") +
  ggtitle('PostQC UMIs/Bin') +
  theme(plot.title = element_text(hjust = 0.5))

# Plot nGene
dist_features_after <- object_filt_meta %>%
  ggplot(aes(x=nFeature_Spatial.016um)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() +
  theme_classic() +
  ylab("Cell density") +
  xlab("Number of genes per bin") +
  ggtitle('PostQC Genes/Bin') +
  theme(plot.title = element_text(hjust = 0.5))

# Combine plots side-by-side
dists_after <- dist_counts_after | dist_features_after
dists_after

Exercise

How many cells did we filter out with these thresholds?

Visualizing Counts Data

We can visualize the number of UMIs and gene counts per bin, both as a distribution and layered on top of the tissue image. Let’s start with a violin plot to look at the distribution of UMI counts and gene counts. The input is our post-filtered dataset.

# Violin plot of UMI counts
vln_counts_after <- VlnPlot(object_filt, 
                            features = "nCount_Spatial.016um", 
                            pt.size = 0, 
                            group.by = 'orig.ident') + 
  NoLegend() + scale_y_log10() + ggtitle('nUMI') + xlab('') + ylim(c(100, 15000))

# Violin plot of gene counts
vln_features_after <- VlnPlot(object_filt, 
                            features = "nFeature_Spatial.016um", 
                            pt.size = 0, 
                            group.by = 'orig.ident') + 
  NoLegend() + scale_y_log10() + ggtitle('nGene') +  xlab('') + ylim(c(100, 15000))


# Plot both side by side
vln_counts_after | vln_features_after

We see that both distributions have a similar peak but the nUMI distribution has a much longer tail. This is expected, because while the small physical size of the bins means that most genes will be detected only once or twice, a minority of bins under very transcriptionally active cells may exhibit multiple transcripts of the same gene.

Next, we can look at the same metrics and the distribution on the actual image itself. Note that many spots have very few counts, in part due to low cellular density or cell types with low complexity in certain tissue regions.

# Visualizing UMI count across the image
image_counts <- SpatialFeaturePlot(object_filt, 
                                   feature = 'nCount_Spatial.016um', 
                                   pt.size.factor = 8)

# Visualizing gene count across the image
image_features <- SpatialFeaturePlot(object_filt, 
                                     features = "nFeature_Spatial.016um", 
                                     pt.size.factor = 8) 

# Plot the two side-by-side
image_counts | image_features

Normalize Data

Normalization is important in order to make expression counts comparable across genes and/or samples. We note that the best normalization methods for spatial data are still being developed and evaluated. In particular, Bhuva et. al tested a variety of normalization techniques and found that normalizing by the number of transcripts detected per bin negatively affects spatial domain identification because total detection per bin can represent real biology. We are cognizant of this, but as discussed earlier, it can be challenging to determine whether a bin has few detections because of a technical artifact or biological signal. In the absence of normalization, this lack of signal will strongly affect clustering regardless of whether it is technical or biological. For this reason, we apply a standard log-transformed library size normalization to our data.

object_filt <- NormalizeData(object_filt, assay = 'Spatial.016um')
object_filt
An object of class Seurat 
32285 features across 41818 samples within 1 assay 
Active assay: Spatial.016um (32285 features, 0 variable features)
 2 layers present: counts, data
 1 spatial field of view present: slice1.016um

And we can see that there is now a new data layer in the Seurat object.

Unsupervised Clustering

The authors of the Seurat package recommend the Seurat v5 sketch clustering workflow because it exhibits improved performance for large datasets, especially for identifying rare and spatially-restricted groups. Sketch-based analyses aim to “subsample” large datasets in a way that preserves rare populations.

We will start by defining a set of highly variable genes. Note that this is being done on all bins in our object. Using this list of genes will help us to quantify the variability and similarity between bins. Essentially, we are looking at genes with high levels of variance while also accounting for the average expression. In doing so, we get a list of genes ranked by how much they change across different cell populations.

object_filt <- FindVariableFeatures(object_filt)
object_filt
An object of class Seurat 
32285 features across 41818 samples within 1 assay 
Active assay: Spatial.016um (32285 features, 2000 variable features)
 2 layers present: counts, data
 1 spatial field of view present: slice1.016um

We can examine our Seurat object and see that FindVariableFeatures() has added 2,000 variable features.

And we can even take a look at what the top 15 most variables features are for the entire dataset. We would anticipate that these genes correspond to the celltypes in our dataset.

VariableFeatures(object_filt)[1:15]
 [1] "Ttr"    "Enpp2"  "Sst"    "Igf2"   "Ecrg4"  "Ptgds"  "Npy"    "Prlr"  
 [9] "Clic6"  "Hbb-bs" "Kl"     "Igfbp2" "Tac2"   "Hba-a2" "Cnr1"  

If we read more about the gene Ttr, we can learn that it is “highly expressed in choroid plexus epithelial cells”. If we were to continue looking at the top variable genes, we would hope to find similarly elucidating function from the genes.

Sketch Downsampling

Next, we select 10,000 cells and create a new sub-sampled “sketch” assay using the SketchData() function. The function takes a normalized single-cell dataset containing a set of variable features and returns a Seurat object with a new assay (sketch), consisting of 10,000 bins selected based off a “leverage score” for each bin.

The leverage score reflects the magnitude of its contribution to the gene-covariance matrix, and its importance to the overall dataset, with rare populations earning a higher leverage score. This means that our 10,000 cells selected for the sketch will oversample rare populations, retaining the biological complexity of the sample, while drastically compressing the dataset.

# we select 10,000 cells and create a new 'sketch' assay
object_filt <- SketchData(
  object = object_filt,
  assay = 'Spatial.016um',
  ncells = 10000,
  method = "LeverageScore",
  sketched.assay = "sketch"
)

object_filt
An object of class Seurat 
64570 features across 41818 samples within 2 assays 
Active assay: sketch (32285 features, 2000 variable features)
 2 layers present: counts, data
 1 other assay present: Spatial.016um
 1 spatial field of view present: slice1.016um

We see that there are four major changes that have taken place in our Seurat object:

  • The number of features in the second line has double, because we have added a new assay
  • Accordingly, the number of assays has increased from one to two
  • The active assay has change from Spatial.016um to sketch
  • There is a new line listing additional assays that exist in the Seurat object

We can also see that the leverage score has been added as a column to the metadata of our object.

View(object_filt@meta.data)

Clustering Workflow

Next, we will perform a standard clustering workflow on our sketch of 10,000 cells:

Function Description
FindVariableFeatures() As before, this generates a list of highly variable genes, which may be slightly different for the sketched dataset than for the full dataset.
ScaleData() Highly variable genes will be confounded with the most highly expressed genes, so we need to adjust for this.
RunPCA() Perform a principal component analysis using our scaled data and variable genes. This will emphasize variation in gene expression as well as similarity across bins.
FindNeighbors() Determine the Euclidean distance between bins in PCA space.
FindClusters() Iteratively group bins together based on neighborhood distances. Higher resolution will yield more groups.

These steps are all a part the standard single-cell RNA-seq workflow. For more detailed information on what each of these steps are, we have a scRNA workshop that detailed each of these steps shown in the following workflow and considerations that should be made:

Variable Features

We calculate the HVGs to use as input for the next step, which is PCA. We can also visualize each gene’s average expression across the bins on the x-axis and variance on the y-axis.

object_filt <- FindVariableFeatures(object_filt)

# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(object_filt,
                                          assay = "sketch")
top_genes <- ranked_variable_genes[1:15]

# Plot the average expression and variance of these genes
# With labels to indicate which genes are in the top 15
p <- VariableFeaturePlot(object_filt)
LabelPoints(plot = p, points = top_genes, repel = TRUE)

Exercise

You may notice that the top variable features are slightly different from when we calculated them previously. Why might that be the case?

PCA

Principal Component Analysis (PCA) is a technique used to emphasize variation as well as similarity, and to bring out strong patterns in a dataset; it is one of the methods used for dimensionality reduction. Ultimately what we have calculated with PCA are values that can represent how similar bins are to one another.

We would expect that bins with similar gene expression will have similar PC values. For example, we would anticipate that two Fibroblast cells would have comparable gene expression - which would also results in similar scores in principal components. This is why many of the downstream steps in the remainder of this lesson use PCA as the input.

object_filt <- ScaleData(object_filt)
object_filt <- RunPCA(object_filt, assay = "sketch", 
                      reduction.name = "pca.sketch")

One way of exploring the PCs is using a heatmap to visualize the most variant genes for select PCs with the genes and bins ordered by PC scores. The idea here is to look at the PCs and determine whether the genes driving them make sense for differentiating the different cell types. Where we would expect to see a large differences in the genes that have positive versus negative scores.

# Explore heatmap of PCs 
DimHeatmap(object_filt, 
           reduction = "pca.sketch",
           dims = 1:6, 
           cells = 500, 
           balanced = TRUE)

Neighborhoods and Clusters

To group our bins together, the first step is to construct a K-nearest neighbor (KNN) graph based on the PCA space. Where edges are drawn between bins with similar features expression patterns. Then, edge weights are refined between any two bins based on shared overlap in their local neighborhoods.

From these neighborhoods, the FindClusters() function will iteratively group bins together using the Louvain algorithm. Where the size of clusters are determined by the resolution parameter. Larger resolution values lead to a more clusters, which is often required for big datasets. Typically you would test a variety of different cluster resolutions to see which best represents the different cell states in your dataset.

Here we are going to use resoultion = 0.65.

object_filt <- FindNeighbors(object_filt, assay = "sketch", 
                             reduction = "pca.sketch", dims = 1:50)
object_filt <- FindClusters(object_filt, 
                            cluster.name = "seurat_cluster.sketched", 
                            resolution = 0.65)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 10000
Number of edges: 401729

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9013
Number of communities: 20
Elapsed time: 0 seconds

We can also see how many bins belong to each one of our clusters:

ggplot(object_filt@meta.data) +
  geom_bar(aes(x = seurat_cluster.sketched, 
               fill = seurat_cluster.sketched)) +
  theme_bw() + NoLegend()

Exercise

Why are there so many NA values in our results?

UMAP

Finally, let’s create a UMAP using the principal components as input. UMAP is a method that aims to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space, which is useful for visualizing our newly calculated clusters. We observe good separation between groups annotated as separate clusters, which is sign that our clustering indeed represents various cell types.

object_filt <- RunUMAP(object_filt, reduction = "pca.sketch", 
                       reduction.name = "umap.sketch", return.model = T, 
                       dims = 1:50)

# Plot UMAP
DimPlot(object_filt, reduction = "umap.sketch", label = T, 
        cols = 'polychrome') + 
  ggtitle("Sketched clustering") + 
  theme(legend.position = "none")

We can also examine our object after these manipulations and note the addition of the scale.data layer as well as the sketch PCA and UMAP dimensional reductions.

object_filt
An object of class Seurat 
64570 features across 41818 samples within 2 assays 
Active assay: sketch (32285 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: Spatial.016um
 2 dimensional reductions calculated: pca.sketch, umap.sketch
 1 spatial field of view present: slice1.016um

Should return:

Project Clusters Back to Entire Dataset

Now that we have our clusters and dimensional reductions from our sketched dataset, we need to extend these to the full dataset. The ProjectData function projects all the bins in the dataset (the Spatial.016um assay) onto the sketch assay.

object_filt <- ProjectData(
  object = object_filt,
  assay = "Spatial.016um",
  full.reduction = "full.pca.sketch",
  sketched.assay = "sketch",
  sketched.reduction = "pca.sketch",
  umap.model = "umap.sketch",
  dims = 1:50,
  refdata = list(seurat_cluster.projected = "seurat_cluster.sketched")
)
object_filt
An object of class Seurat 
64570 features across 41818 samples within 2 assays 
Active assay: sketch (32285 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 other assay present: Spatial.016um
 4 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch
 1 spatial field of view present: slice1.016um

Using the sketch PCA and UMAP, the ProjectData function returns a Seurat object that includes:

  • Dimensional reduction (PCA) - The full.pca.sketch dimensional reduction extends the PCA reduction on the sketched cells to all bins in the dataset
  • Dimensional reduction (UMAP) - The full.umap.sketch dimensional reduction extends the UMAP reduction on the sketched cells to all bins in the dataset
  • Cluster labels - The seurat_cluster.projected column in the object metadata now labels all cells in the dataset with one of the cluster labels derived from the sketched cells

We can now see the additional full-dataset reductions in the object.

Note that a score for the projection of each bin will be saved as a column in the metadata. Actually opening up the metadata again gives the opportunity to look at the seurat_cluster.sketched column and see many NA values, because it was only calculated for 10,000 bins. The seurat_cluster.projected shows values for every bin.

View(object_filt@meta.data)

Visualizing the Projected Clusters on UMAP

We can now visualize our clusters from the projected assignments. The UMAP plot now contains more points, which is expected because we are now visualizing the full dataset rather than our 10,000 bin sketch. Nonetheless, we can see that the full dataset is still well-represented by the projected dimensional reduction and clustering.

# switch to full dataset assay
DefaultAssay(object_filt) <- "Spatial.016um"

# Change the idents to the projected cluster assignments
Idents(object_filt) <- "seurat_cluster.projected"

# Plot the UMAP
DimPlot(object_filt, reduction = "full.umap.sketch", label = T, raster = F, 
              cols = 'polychrome') +
  ggtitle("Projected clustering") + NoLegend()

Visualizing Projected Clusters on the Image

In order to see the clusters superimposed on our image we can use the SpatialDimPlot() function. We will also set the color palette and convert the cluster assignments to a factor so they are ordered numerically rather than lexicographically in the figure.

# Arrange so clusters get listed in numerical order
object_filt$seurat_cluster.projected <- object_filt$seurat_cluster.projected %>% 
  as.numeric %>% as.factor()

# Set color palette
color_pal <- Seurat::DiscretePalette(n = length(unique(object_filt$seurat_cluster.projected)),
                                    palette = "polychrome")
names(color_pal) <- sort(unique(object_filt$seurat_cluster.projected))
image_seurat_clusters <- SpatialDimPlot(object_filt, 
                                        group.by = 'seurat_cluster.projected', 
                                        pt.size.factor = 8, cols = color_pal) +
  guides(fill=guide_legend(ncol=2))

image_seurat_clusters

Spatially-informed Clustering

BANKSY is another method for performing clustering. Unlike Seurat, BANKSY takes into account not only an individual bin’s expression pattern but also the mean and the gradient of gene expression levels in a bin’s broader neighborhood. This makes it valuable for identifying and defining spatial regions of interest.

We use the RunBanksy function to create a new “BANKSY” assay based on a default of the 4,000 most highly variable features, which can be used for dimensionality reduction and clustering. Two parameters of importance are:

  • k_geom - Number of bins to consider for the local neighborhood. Larger values will yield larger domains.
  • lambda - Influence of the neighborhood. Larger values yield more spatially coherent domains. The authors recommend using 0.8 to identify broader spatial domains.
# Run Banksy
object_filt <- RunBanksy(object_filt, lambda = 0.8, verbose = T,
                         assay = 'Spatial.016um', slot = 'data', k_geom = 50)
object_filt
An object of class Seurat 
68570 features across 41818 samples within 3 assays 
Active assay: BANKSY (4000 features, 0 variable features)
 2 layers present: data, scale.data
 2 other assays present: Spatial.016um, sketch
 4 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch
 1 spatial field of view present: slice1.016um

We can see the new BANKSY assay in our object:

What is BANKSY doing under the hood?

BANKSY calculates two different scores:

  • Weighted mean expression of genes in a spatial neighborhood
  • “Azimuthal gabor filter”, which is the “gradient of gene expression in each cell’s neighborhood”

In doing so, BANKSY is able to be flexible and not limit celltypes to only be located nearby one another. Even more than that, there is our lambda value that scales the weight of each of these matrices to give the user flexibility in how strongly the spatial weights should impact clustering.

These new values are then concatenated to the counts matrix before running the remainder of a standard scRNA-seq workflow (dimensionality reduction, clustering, integration, etc.). Therefore, the final output of this algorithm is a modulated counts matrix that includes weighted scores that correspond to spatial domains.

At this point, we have calculated our BANKSY matrix that includes the weighted scores based upon the lambda value. We can see this more clearly if we investigate the Features() that exist in our BANKSY assay. The first few features are our original counts matrix with genes:

Features(object_filt) %>% head()
[1] "Ttr"   "Enpp2" "Sst"   "Igf2"  "Ecrg4" "Ptgds"

Where we also have “new features” that are our lambda scaled neighborhood weighted values, with an appended *.m0 to make the distinction clear.

Features(object_filt) %>% tail()
[1] "Gimap4.m0"     "Tmtc1.m0"      "Syt3.m0"       "Matn4.m0"     
[5] "AC151602.1.m0" "Prmt8.m0"     

And so with our newly calculated BANKSY matrix, we can perform a simplified clustering workflow (PCA, neighbors, clustering):

object_filt <- RunPCA(object_filt, assay = "BANKSY", 
                      reduction.name = "pca.banksy", 
                      features = rownames(object_filt), npcs = 30)
object_filt <- FindNeighbors(object_filt, reduction = "pca.banksy", 
                             dims = 1:30)
object_filt <- FindClusters(object_filt, cluster.name = "banksy_cluster",
                            resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 41818
Number of edges: 1066661

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9501
Number of communities: 27
Elapsed time: 2 seconds

Let’s visualize the BANKSY clusters alongside the Louvain clusters (seurat_cluster.projected) for a side-by-side comparison:

color_pal <- Seurat::DiscretePalette(n = length(unique(object_filt$banksy_cluster)),
                                    palette = "polychrome")
names(color_pal) <- sort(unique(object_filt$banksy_cluster))

image_banksy_clusters <- SpatialDimPlot(object_filt, group.by = "banksy_cluster",
                                        pt.size.factor = 7, cols = color_pal)

image_seurat_clusters | image_banksy_clusters

We can see that, as expected, the BANKSY clusters are more spatially-restricted, or more compact, than the Seurat clusters. We also see that the BANKSY clusters are less noisy than the Seurat clusters, likely because of the smoothing effect of considering a cell’s spatial neighborhood when assigning a cluster label.

If we had run BANKSY with lambda = 0.2, as recommended for cell type clustering instead of lambda = 0.8 for spatial domain clustering, the resultant clusters would be less spatially restricted (in other words less compact and more distributed throughout the image) and more similar to our Seurat clustering. Below is a figure using lamba=0.2 in BANKSY rather than lamba=0.8.

Cell Type Annotation

Perhaps we are particularly interested in understanding the organization of cell types in the cortical region of the brain.

We first subset our Seurat object to this region of interest. There are multiple ways of subsetting a Seurat object to a region of interest, but here we have identified a handful of cluster numbers that appear almost exclusively in the cortical region, and we will subset the object to only include cells that are assigned these cluster numbers.

cortex <- subset(object_filt, seurat_cluster.projected %in% c(18, 19, 7, 2, 4))

color_pal <- Seurat::DiscretePalette(n = length(unique(object_filt$seurat_cluster.projected)),
                                    palette = "polychrome")
names(color_pal) <- sort(unique(object_filt$seurat_cluster.projected))
SpatialDimPlot(cortex, group.by = 'seurat_cluster.projected', 
               pt.size.factor = 8, cols = color_pal)

Differences in plots

Your colors may be different than the ones in the above figure.

To perform accurate annotation of cell types, we must also take into consideration that our 16µm x 16µm bins may contain one or more cells each. The method Robust Cell Type Deconvolution (RCTD) has been shown to accurately annotate spatial data from a variety of technologies while taking into consideration that a single bin may exhibit multiple cell type profiles.

RCTD takes a cell-type-annotated scRNA-seq dataset as a reference and a spatial dataset as a query. For our reference, we will use a subsampled version of the mouse scRNA-seq dataset from the Allen Brain Atlas. We will use our cortex Seurat object as the spatial query. As an overview, the process is as follows:

  1. Sketch and process the spatial query dataset
  2. Load and format the scRNA-seq reference dataset
  3. Apply RCTD to deconvolute the “sketched” cortical cells and annotate them
  4. Project these annotations onto the full cortical dataset.

1) Sketch and process the spatial query dataset

As we have subset the dataset, we need to re-run the typical scRNA workflow. This is due to the fact that we would be able to recognize more subtle shifts in these cortex bins as we do not have the rest of the dataset weights to consider anymore.

# Create sketch of cortex subset
DefaultAssay(cortex) <- 'Spatial.016um'
cortex <- FindVariableFeatures(cortex)
cortex <- SketchData(
  object = cortex,
  ncells = 3000,
  method = "LeverageScore",
  sketched.assay = "sketch"
)

# Run through scRNA workflow
DefaultAssay(cortex) <- "sketch"
cortex <- ScaleData(cortex)
cortex <- RunPCA(cortex, assay = "sketch", 
                 reduction.name = "pca.cortex.sketch", verbose = T)
cortex <- FindNeighbors(cortex, reduction = "pca.cortex.sketch", dims = 1:50)
cortex <- RunUMAP(cortex, reduction = "pca.cortex.sketch", 
                  reduction.name = "umap.cortex.sketch", return.model = T, 
                  dims = 1:50, verbose = T)

RCTD requires a unique data structure (similar to Seurat) as input to run. So we create our query object with SpatialRNA() by suppling the spatial coordinates and raw counts for our cells.

# Create the RCTD query object
# Only using sketched cells
counts_hd <- cortex[["sketch"]]$counts
cortex_cells_hd <- colnames(cortex[["sketch"]])
coords <- GetTissueCoordinates(cortex)[cortex_cells_hd, 1:2]
query <- SpatialRNA(coords, counts_hd, colSums(counts_hd))

2) Load and format the reference dataset

After loading our reference dataset, we can take a quick look at the different celltypes that we are going to try annotating our query object with. These values are stored in the column subclass_label.

# Increase amoutn of memory R can use
mem.maxVSize(15000)
[1] 15000
ref_subset <- qread("data_processed/allen_scRNAseq_ref_subset.qs")

ggplot(ref_subset@meta.data) +
  geom_bar(aes(x = subclass_label, 
               fill = subclass_label)) +
  theme_bw() + NoLegend() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

We are going to make another RCTD object out of the reference dataset. This time we use the Reference() function and additionally supply cluster (celltype) information to transfer to our query.

Idents(ref_subset) <- "subclass_label"
counts <- ref_subset[["RNA"]]$counts
cluster <- as.factor(ref_subset$subclass_label)
nUMI <- ref_subset$nCount_RNA
levels(cluster) <- gsub("/", "-", levels(cluster))
cluster <- droplevels(cluster)

# create the RCTD reference object
reference <- Reference(counts, cluster, nUMI)

3) Apply RCTD to deconvolute the “sketched” cortical cells and annotate them

Note that run.RCTD takes 10-15 minutes to complete on a laptop using 6 cores

# run RCTD
RCTD <- create.RCTD(query, reference, max_cores = 6)
RCTD <- run.RCTD(RCTD, doublet_mode = "doublet") # this command takes ~15 mins to run

# add results back to Seurat object
cortex <- AddMetaData(cortex, metadata = RCTD@results$results_df)

The resultant dataframe from RCTD will contain the following columns according to the documentation:

  • spot_class, a factor variable representing RCTD’s classification in doublet mode:
    • “singlet” (1 cell type on pixel)
    • “doublet_certain” (2 cell types on pixel)
    • “doublet_uncertain” (2 cell types on pixel, but only confident of 1)
    • “reject” (no prediction given for pixel)
  • first_type column gives the first cell type predicted on the bead (for all spot_class conditions except “reject”).
  • second_type column gives the second cell type predicted on the bead for doublet spot_class conditions (not a confident prediction for “doublet_uncertain”).

Which we can access from our cortex object:

cols <- c("spot_class", "first_type", "second_type")
cortex@meta.data[cols] %>%
  subset(!is.na(spot_class)) %>%
  View()

4) Project RCTD labels onto all cortical cells

cortex$first_type <- as.character(cortex$first_type)
cortex$first_type[is.na(cortex$first_type)] <- "Unknown"
cortex <- ProjectData(
  object = cortex,
  assay = "Spatial.016um",
  full.reduction = "pca.cortex",
  sketched.assay = "sketch",
  sketched.reduction = "pca.cortex.sketch",
  umap.model = "umap.cortex.sketch",
  dims = 1:50,
  refdata = list(full_first_type = "first_type")
)

We can see that the excitatory neurons are located in layers at varying cortical depths, as expected.

Idents(cortex) <- "full_first_type"
cells <- CellsByIdentities(cortex)

# Layered (starts with L), excitatory neurons in the cortex
excitatory_names <- sort(grep("^L.* CTX", names(cells), value = TRUE))
SpatialDimPlot(cortex, cells.highlight = cells[excitatory_names], 
               cols.highlight = c("#FFFF00", "grey50"), facet.highlight = T, 
               combine = T, ncol = 4, pt.size.factor = 8)

Save R Session

At the end of any analysis, it is important to save your session information! This will let future users know which versions of packages are necessary in order to reproduce the same results.

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future_1.68.0        DT_0.34.0            spacexr_2.2.1       
 [4] quadprog_1.5-8       Banksy_1.5.10        SeuratWrappers_0.4.0
 [7] qs_0.27.3            Seurat_5.4.0         SeuratObject_5.3.0  
[10] sp_2.2-0             patchwork_1.3.2      lubridate_1.9.4     
[13] forcats_1.0.1        stringr_1.6.0        dplyr_1.1.4         
[16] purrr_1.2.1          readr_2.1.6          tidyr_1.3.2         
[19] tibble_3.3.1         ggplot2_4.0.1        tidyverse_2.0.0     

loaded via a namespace (and not attached):
  [1] RcppHungarian_0.3           RcppAnnoy_0.0.23           
  [3] splines_4.4.1               later_1.4.5                
  [5] R.oo_1.27.1                 polyclip_1.10-7            
  [7] fastDummies_1.7.5           lifecycle_1.0.5            
  [9] aricode_1.0.3               doParallel_1.0.17          
 [11] globals_0.18.0              lattice_0.22-7             
 [13] MASS_7.3-65                 crosstalk_1.2.2            
 [15] magrittr_2.0.4              sass_0.4.10                
 [17] plotly_4.11.0               rmarkdown_2.30             
 [19] jquerylib_0.1.4             yaml_2.3.12                
 [21] remotes_2.5.0               httpuv_1.6.16              
 [23] otel_0.2.0                  sctransform_0.4.3          
 [25] spam_2.11-3                 spatstat.sparse_3.1-0      
 [27] reticulate_1.44.1           cowplot_1.2.0              
 [29] pbapply_1.7-4               RColorBrewer_1.1-3         
 [31] abind_1.4-8                 zlibbioc_1.52.0            
 [33] Rtsne_0.17                  GenomicRanges_1.58.0       
 [35] R.utils_2.13.0              BiocGenerics_0.52.0        
 [37] GenomeInfoDbData_1.2.13     IRanges_2.40.1             
 [39] S4Vectors_0.44.0            ggrepel_0.9.6              
 [41] irlba_2.3.5.1               listenv_0.10.0             
 [43] spatstat.utils_3.2-1        goftest_1.2-3              
 [45] RSpectra_0.16-2             spatstat.random_3.4-3      
 [47] fitdistrplus_1.2-4          parallelly_1.46.1          
 [49] codetools_0.2-20            DelayedArray_0.32.0        
 [51] RApiSerialize_0.1.4         tidyselect_1.2.1           
 [53] UCSC.utils_1.2.0            farver_2.1.2               
 [55] matrixStats_1.5.0           stats4_4.4.1               
 [57] spatstat.explore_3.6-0      jsonlite_2.0.0             
 [59] progressr_0.18.0            iterators_1.0.14           
 [61] ggridges_0.5.7              survival_3.8-3             
 [63] foreach_1.5.2               dbscan_1.2.4               
 [65] tools_4.4.1                 ica_1.0-3                  
 [67] Rcpp_1.1.1                  glue_1.8.0                 
 [69] gridExtra_2.3               SparseArray_1.6.2          
 [71] xfun_0.55                   MatrixGenerics_1.18.1      
 [73] GenomeInfoDb_1.42.3         withr_3.0.2                
 [75] BiocManager_1.30.27         fastmap_1.2.0              
 [77] digest_0.6.39               rsvd_1.0.5                 
 [79] timechange_0.3.0            R6_2.6.1                   
 [81] mime_0.13                   scattermore_1.2            
 [83] sccore_1.0.6                tensor_1.5.1               
 [85] dichromat_2.0-0.1           spatstat.data_3.1-9        
 [87] R.methodsS3_1.8.2           generics_0.1.4             
 [89] data.table_1.18.0           httr_1.4.7                 
 [91] htmlwidgets_1.6.4           S4Arrays_1.6.0             
 [93] uwot_0.2.4                  pkgconfig_2.0.3            
 [95] gtable_0.3.6                lmtest_0.9-40              
 [97] S7_0.2.1                    SingleCellExperiment_1.28.1
 [99] XVector_0.46.0              htmltools_0.5.9            
[101] dotCall64_1.2               scales_1.4.0               
[103] Biobase_2.66.0              png_0.1-8                  
[105] SpatialExperiment_1.16.0    spatstat.univar_3.1-5      
[107] knitr_1.51                  rstudioapi_0.17.1          
[109] tzdb_0.5.0                  reshape2_1.4.5             
[111] rjson_0.2.23                nlme_3.1-168               
[113] cachem_1.1.0                zoo_1.8-15                 
[115] KernSmooth_2.23-26          vipor_0.4.7                
[117] parallel_4.4.1              miniUI_0.1.2               
[119] ggrastr_1.0.2               pillar_1.11.1              
[121] grid_4.4.1                  vctrs_0.6.5                
[123] RANN_2.6.2                  promises_1.5.0             
[125] stringfish_0.17.0           xtable_1.8-4               
[127] cluster_2.1.8.1             beeswarm_0.4.0             
[129] evaluate_1.0.5              magick_2.9.0               
[131] cli_3.6.5                   compiler_4.4.1             
[133] rlang_1.1.7                 crayon_1.5.3               
[135] future.apply_1.20.1         labeling_0.4.3             
[137] mclust_6.1.2                ggbeeswarm_0.7.3           
[139] plyr_1.8.9                  stringi_1.8.7              
[141] viridisLite_0.4.2           deldir_2.0-4               
[143] lazyeval_0.2.2              spatstat.geom_3.6-1        
[145] Matrix_1.7-4                RcppHNSW_0.6.0             
[147] hms_1.1.4                   shiny_1.12.1               
[149] SummarizedExperiment_1.36.0 ROCR_1.0-11                
[151] leidenAlg_1.1.5             igraph_2.2.1               
[153] bslib_0.9.0                 RcppParallel_5.1.11-1